Target of this data exploration is figure out which chemical chracteristics have influence on red wine quality. What property makes red wine to be good?

names(rw)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
summary(rw)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Quality Distribution

Each expert graded the wine quality (discrete number) between 0 (very bad) and 10 (very excellent). In data set we can see that expert graded by ranged from 3 to 8. The median value is 6.

Distribution of chemical features (properties)

## Warning: position_stack requires constant width: output may be incorrect

New variables

Long tail and skewed features can be transformed to more normally distribution by square root or log function. As example “sulphates” feature:

Original

Log

Square root

Both transformation looks better than original (more normal distributed), But the log scale feature looks more normal distributed.

Univariate Analysis

What is the structure of your dataset?

This “Red Wine” data set contains 1 599 obersvations with 11 variables (features) on the chemical properties of the wine.

Distributions of attributes

  • Normal: Volatile acidity, Density, PH
  • Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol
  • Long Tail: Residual sugar, Chlorides

Attribute information:

   For more information, read [Cortez et al., 2009].

   Input variables (based on physicochemical tests):
   1 - fixed acidity (tartaric acid - g / dm^3)
   2 - volatile acidity (acetic acid - g / dm^3)
   3 - citric acid (g / dm^3)
   4 - residual sugar (g / dm^3)
   5 - chlorides (sodium chloride - g / dm^3
   6 - free sulfur dioxide (mg / dm^3)
   7 - total sulfur dioxide (mg / dm^3)
   8 - density (g / cm^3)
   9 - pH
   10 - sulphates (potassium sulphate - g / dm3)
   11 - alcohol (% by volume)
   
   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

Description of attributes:

   1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
   2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
   3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
   4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
   5 - chlorides: the amount of salt in the wine
   6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
   7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
   8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
   9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
   10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
   11 - alcohol: the percent alcohol content of the wine

   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are “quality”. I’d like to determine which features are best for predicting the “quality” of a diamond. I suspect “Volatile acidity”, “pH” and some combination of the other variables can be used to build a predictive model to “quality”.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I suspect “Volatile acidity”, “pH”, “Residual sugar” variables can help in investigation predictive model to “quality”.

Did you create any new variables from existing variables in the dataset?

I created “log_sulphates” that if transform the feature toward normal distribution. For able use that feature more effective with prediction model (leniar regression).

Additionally transformed wine quality into categorical variable. Wine quality is desecrate value, so we can transform it from numerical to categorical data.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

This data set is tidy no need data wrangle. But the Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol histograms all skewed right with a long tail. I had to perform a log/sqrt transformation to better understand the data.

Bivariate Plots Section

Matrix plots to understand the relationships between variables by glance. We try find correlation between the wine quality and each other property.

## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect

In table above we can see top 5 most correlated with “quality”:

Feature r-value
alcohol 0.476
volatile.acidity -0.391
sulphates 0.251
citric.acid 0.226
total.sulfur.dioxide -0.185

“alcohol” feature has the strongest correlation value to the wine quality. The higher quality wine tend to have higher alcholol.

Compare quality

Boxplots

  • We can clearly see that the moste coralation with quality has alcohol feature.
  • Sulphates level make wines more quality. However, they should lie under the 1.0 value.
  • The SO2 mostly lie under the 50 g/dm^3 level.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a very good relationship between alcohol and quality. The other features didn’t seem to affect quality as much as alcohol.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a weaker but still strong relationship between volatile.acidity, sulphates, citric.acid, total.sulfur.dioxide and quality. The other features didn’t seem to affect quality.

What was the strongest relationship you found?

The strongest relationship is alcohol with “r-value” equat to “0.476”.

Multivariate Plots Section

Top 2 Main chemical property vs wine quality

Fow adding more variabled to anaysys we will add different colors (adding additional dimension). There are 5 main features. Let’s take first 2 features “alcohol” and “volatile acidity”.

We can clearly see that the higher quality wine have higher alcohol and lower volatile acidity.

Add Sulphates dimension

The higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates (red color).

Other variables

The most of high quality wines consist from 0.25 to 0.75 citric acid. We can see higher quality wine have higher alcohol (x-axis), lower citric acid (y-axis) and lower total sulfur dioxide (purple color).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Wine quality correlates strongly with alcohol and four other variables “volatile.acidity”, “sulphates”, “citric.acid”, “total.sulfur.dioxide”.

  • Looks like wines with lower volatile acidity value will be of higher quality with the equal level of alcohol.
  • Another confirmation of sulphates influence. Seems like quality wines mostly consists no more than 50g of total SO2.
  • HQ wines in general under the 0.4 “volatile acidity”. But in the same time high quality wines has a big dispersion of sulphates than the other wines.

The relationship between quality and alcohol looks linear.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Linear Model

Linear multivariable model created for predict the wine quality based on chemical properties. The features are selected order of how strong the correlation between this feature and wine quality.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = rw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = rw)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = rw)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = rw)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide, data = rw)
## 
## =================================================================================
##                          m1        m2        m3        m4        m5        m6    
## ---------------------------------------------------------------------------------
## (Intercept)            1.875***  3.095***  2.611***  2.611***  2.646***  2.843***
##                       (0.175)   (0.184)   (0.196)   (0.196)   (0.201)   (0.205)  
## alcohol                0.361***  0.314***  0.309***  0.309***  0.309***  0.295***
##                       (0.017)   (0.016)   (0.016)   (0.016)   (0.016)   (0.016)  
## volatile.acidity                -1.384*** -1.221*** -1.221*** -1.265*** -1.222***
##                                 (0.095)   (0.097)   (0.097)   (0.113)   (0.112)  
## sulphates                                  0.679***  0.679***  0.696***  0.721***
##                                           (0.101)   (0.101)   (0.103)   (0.103)  
## citric.acid                                                   -0.079    -0.043   
##                                                               (0.104)   (0.104)  
## total.sulfur.dioxide                                                    -0.002***
##                                                                         (0.001)  
## ---------------------------------------------------------------------------------
## R-squared                 0.227     0.317     0.336     0.336     0.336     0.344
## adj. R-squared            0.226     0.316     0.335     0.335     0.334     0.342
## sigma                     0.710     0.668     0.659     0.659     0.659     0.655
## F                       468.267   370.379   268.912   268.912   201.777   166.962
## p                         0.000     0.000     0.000     0.000     0.000     0.000
## Log-likelihood        -1721.057 -1621.814 -1599.384 -1599.384 -1599.093 -1589.749
## Deviance                805.870   711.796   692.105   692.105   691.852   683.814
## AIC                    3448.114  3251.628  3208.768  3208.768  3210.186  3193.499
## BIC                    3464.245  3273.136  3235.654  3235.654  3242.448  3231.138
## N                      1599      1599      1599      1599      1599      1599    
## =================================================================================

Because we no need model for predict quality in feature, we can use whole data set for create model and look on “R-squared” value. The model with 6 features has the highest R-squared number. As the number of features increase the R-squared becomes higher.

The model can be described as:

wine_quality = 2.843 + 0.295 x alcohol - 1.222xvolatile.acidity + 0.721xsulphates - 0.043xcitric.acid - 0.002xtotal.sulfur.dioxide

R-squared: 0.344

I think that R-squared is not good and probably can’t be used in production system. We need try another model like binomial model regression.


Final Plots and Summary

Plot One

Description One

We can clearly see that the distribution of wine quality is irregularly. The data has many items on medium quality (grade 5, 6), but fewer count on low (grade 3,4) and high (grade 7, 8) quality wine.

Plot Two

Description Two

There is 5 features with the highest correlation (with quality) coefficient are alcohol, volatile acidity, sulphates, citric acid, total sulfur dioxide. The wine quality are grouped to low (3,4) medium (5.6) and high (7,8). High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Critic acid and sulphate increase as wine quality increase. Volatile acidity decrease as wine quality increases.

Plot Three

Description Three

Scatter plot of top 4 features. 2 features are plotted with color that indicate wine quality. The same trend as the last figure can be observed. In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.


Reflection

The red wine dataset contains information about 1599 red wines. I started out with single variate analysis. I analysed the impact of alcohol, volatile.acidity, sulphates, sulphates, citric.acid, total.sulfur.dioxide features on the quality of the red wines. I found a few interesting results especially about respect to the impact of alcohol on the quality of the wines.

Then, I moved to bivariate analysis. I tried various combinations of the variables in the data set and tried to analyse their impact on the quality of the wines. After that, I used various techniques of multivariate analysis to analyse the impact of the variables on the red wines.

I created and included in my analysys linear model, but I think that it should not be used in production system because of small R-squared.

Possible future researches:

For future exploration of this data I would like take one category of wine (for example, quality level 7 or 8) to look at the patterns which can appear in each of the quality level. Additionaly will be good get more features about red wine.

EDA really exciting and may take a huge time to research.

Reference

[1] http://en.wikipedia.org/wiki/Red_wine

[2] http://www.calwineries.com/learn/wine-chemistry/wine-acids/citric-acid

[3] http://ggplot2.org/